Searching PA J 
f .1 



Page 1 of 2 



PATENT ABSTRACTS OF JAPAN 



(1 1)Publication number : 08-161340 
(43)Date of publication of application : 21.06.1996 



(51)lnt.CI. 




G06F 17/28 
G06F 17/22 
G06F 17/27 




(21)Application number 


06-307223 


(71)Applicant 


RICOH CO LTD 


(22)Date of filing: 


12.12.1994 


(72)lnventor : 


KATOOKATAKASHI 



1 



y'mm 



(54) AUTOMATIC COMPOUND WORD EXTRACTION DEVICE 

(57)Abstract: 

PURPOSE: To efficiently and automatically collect a compound word 
whose degree of cooccurrence is large with respect to an idiom and a 
compound word by the combination of words whose speciality is not 
high. 

CONSTITUTION: An N-gram segment device 2 segments N-gram of a 
word from an objective document which is read from an objective 
document input part 1 . A frequency adding-up device 3 adds up the 
appearing frequency of the compound word of segmented N-gram and 
a word storage device 4 stores the N- gram compound word and 
appearing frequency that the frequency adding-up device 3 adds up. A 
cooccurrence degree calculation device 5 calculates the cooccurrence 
degree of N-gram by using appearing frequency in the objective 
document of the respective words constituting N-gram and the 
appearing frequency of N-gram itself. A classification device 6 

rearranges information in the word storage device 4 by the value of the cooccurrence degree calculated by 
the cooccurrence degree calculation device 5. 
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JPO and INPIT are not responsible for any 
damages caused by the use of this translation. 

1 .This document has been translated by computer. So the translation may not reflect the original precisely. 
2.**** shows the word which can not be translated. 
3.1n the drawings, any words are not translated. 



CLAIMS 



[Claim(s)] 

[Claim 1] The object document input section which reads an object document, and the logging equipment which starts 
N-gram (N= 1, 2 and 3, — , Nmax) of a word from the object document read into this object document input section, 
The frequency total equipment which totals the frequency of occurrence of the connection word of N-gram started by 
this logging equipment. The word storage which memorizes the frequency of occurrence of a N-gram connection word 
and this connection word. Whenever [ coincidence / which calculates whenever / coincidence / of N-gram / using the 
frequency of occurrence in the object document of each word which constitutes said N-gram, and the frequency of 
occurrence of N-gram itself] Count equipment, Copula automatic extracting equipment characterized by having 
classification equipment into which the information in said word storage is put in order and changed with the value of 
whenever [ coincidence / which was calculated by count equipment whenever / this coincidence ]. 
[Claim 2] Copula automatic extracting equipment according to claim 1 characterized by extracting a copula with a still 
more sufficient precision using the pattem match equipment for eliminating N-gram suitable for the configuration 
memorized in the conditioning storage which memorizes the configuration which does not suit the conditions which 
should be extracted as vocabulary, and said word storage. 



[Translation done.] 
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3.1n the drawings, any words are not translated. 



DETAILED DESCRIPTION 



[Detailed Description of the Invention] 
[0001] 

[Industrial Application] This invention relates to the copula automatic extracting equipment for carrying out automatic 
collection of the copula efficiently from an object document at a detail more about the copula automatic extracting 
equipment in language-processing equipment. For example, it is applied to vocabulary dictionary listing devices, such 
as machine translation and a word processor. 
[0002] 

[Description of the Prior Art] As well-known reference which indicated conventional language-processing equipment, 
there is JP,6-19968,A, for example. In order for the thing of this official report to enable it to extract a technical term 
easily out of a huge word and to enable it to build a technical-term dictionary easily for a short time An input statement 
is divided into a word with word division equipment, and normalization of part-of-speech information being given is 
performed. The input data which it normalized with word division equipment is outputted to technical-term judging 
equipment, while this technical-term judging equipment refers said each dictionary, evaluation of each word is 
performed, and the candidate of a technical term is extracted according to this evaluation. However, a technical-term 
judging is performed in consideration of the number of configuration words, the operating frequency of a configuration 
word, the vocabulary dictionary classified by field, and a type of letters (katakana word), and the vocabulary dictionary 
classified by field is needed. Moreover, the thing of said official report does not have the description about the 
technical-term candidate selection for a judgment. 
[0003] 

[Problem(s) to be Solved by the Invention] For example, in a machine translation system, in translating the text of a 
certain specific field, it is, or it registers the vocabulary of the field into the dictionary in advance how much, and the 
.„ engine performance of a translation acts greatly. However, by the conventional technical-term detection approach of 
having used unknown word retrieval, there was a fault that the technical term which consists of two or more words 
could not be efficiently extracted like an idiom or a copula. Moreover, the thing of said official report was collecting 
vocabulary using the vocabulary dictionary classified by field, and there was a fault that equipment became heavy. 
[0004] This invention discerns coinciding the entry which was not made in view of such the actual condition, and 
consists of two or more words also to the idiom and copula by combination of the word which is not a high word of an 
expert, or the connection depended by chance, and it is efficient and it aims at offering the copula automatic extracting 
equipment with which the degree of coincidence collected strong copulas automatically. 
[0005] 

[Means for Solving the Problem] The object document input section which reads the document for (1) in order that this 
invention may solve the above-mentioned technical problem. The logging equipment which starts N-gram (N= 1, 2 and 
3, — , Nmax) of a word from the object document read into this object docimient input section. The frequency total 
equipment which totals the frequency of occurrence of the connection word of N-gram started by this logging 
equipment, The word storage which memorizes the frequency of occurrence of a N-gram connection word and this 
coimection word. Whenever [ coincidence / which calculates whenever / coincidence / of N-gram / using the frequency 
of occurrence in the object document of each word which constitutes said N-gram (in the case of N= 1), and the 
frequency of occurrence of N-gram itself] Count equipment, having classification equipment into which the 
information in said word storage is put in order and changed with the value of whenever [ coincidence / which was 
calculated by count equipment whenever / this coincidence ] - fiirther (2) It is characterized by extracting a copula 
with a still more sufficient precision using the pattern match equipment for eliminating N-gram suitable for the 
configuration memorized in the conditioning storage which memorizes the configuration which does not suit the 
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conditions which should be extracted as vocabulary, and said word storage. 

;[0006] 

[Function] The copula automatic extracting equipment of this invention which has said configuration (1) Input an 
object document and N-gram (N= 1, 2 and 3, — , Nmax) of a word is started fi-om this object document. Total the 
* frequency of occurrence of the connection word of started N-gram, and a N-gram connection word and its frequency of 
occurrence are memorized. By calculating whenever [ coincidence / of N-gram ] using the frequency of occurrence in 
the object document of each word which constitutes said N-gram (in the case of N= 1), and the frequency of occurrence 
of N-gram itself, and putting in order and changing the information in storage with the value of whenever 
[ coincidence ] The word which appears as a copula with strong coincidence in an input statement with easy equipment 
can be quickly extracted efficiently automatically without using a dictionary etc., since it can ask for the strength [ the 
configuration word of a copula ] of coincidence from the fi-equency of occurrence of the configuration word, and the 
frequency of occurrence of a copula. (2) Memorize the configuration which does not suit the conditions which should 
be extracted as vocabulary, and since a copula is extracted with a still more sufficient precision using the pattern match 
equipment for eliminating N-gram suitable for the configuration which memorized By memorizing beforehand the N- 
gram pattem considered to be unsuitable for extracting as vocabulary in the word extracted above (1) to the 
conditioning store, what matches this pattem from a vocabulary candidate can be eliminated, and the vocabulary can be 
extracted with a sufficient precision. 
[0007] 

[Example] An example is explained below with reference to a drawing. First, the connection word of N-gram is 
extracted fi-om an object document (N= 1, 2 and 3, — , Nmax). If it is the case where an object document is English, it 
will refer to a space character etc. fi-om the gestalt-description of language, and every word will be divided. If it is 3- 
gram of N= 3, the vocabulary which carries out a maximum of 3 word connection will be started. Extracted N 
connection word carries out the count total of the frequency of occurrence with the equipment which totals the 
frequency of occurrence. Moreover, the fi-equency of occurrence for every word is coimted, and it totals. This result is 
memorized by N-gram word storage. 

[0008] After the total of the frequency of occurrence to an input-statement document finishes, count of whenever 
[ coincidence ] is calculated according to the following foraiulas to the connection word of N-gram in N-gram word 
storage. When the words of N connection, i.e., the configuration word of a copula, are wl, w2, w3, — , wN, 
respectively, the frequency of occurrence of the N connection word itself expresses [ each frequency of occurrence ] 
with H (wl), H (w2), H (w3) H (wl, w2, w3, -, wN). Moreover, the total number of words of an object input- 
statement document is set to A. 
[0009] 
[Equation 1] 
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P(wKw2. w3,..., wN) 

(1) 



P(wl) X P(w2) X P(w3) X ... XP(wN) 



P(wl)= 
P(w2)= 

P{w3)= 



H(wl) 

A 
H(w2) 

A 

H(w3) 



P{wlNO= 



H(wN) 



PCwl.w2.w3 wN) = — — ^ (2) 

A— (N — 1 ; 



A 



[0010] (1) The denominator of a formula expresses the probabihty which each word connects by chance from the 
appearance probability of each word which constitutes a copula. (1) The molecule of a formula is the probability for 
each word to be connected and to actually appear. Therefore, (1) type serves as a ratio of the probability which a certain 
copula actually connects, and the probability connected [ chance ]. (1) It can be said that the connection word of the N- 
gram has the high degree which coincides and appears, so that the value of a formula is high, conversely, when low, it 
coincides — as — ** ~ possibility of connecting by chance is high. 

[001 1] Drawing 1 is a block diagram for explaining one example (example 1) of the copula automatic extracting 
equipment by this invention, and, for N-gram logging equipment and 3, as for N-gram word storage (frequency-of- 
occurrence storage) and 5, the frequency total equipment of N-gram and 4 are [ one / the object document input section 
and 2 / count equipment and 6 ] classification equipment whenever [ coincidence ] among drawing. 
[0012] N-gram (N= 1, 2 and 3, ~, Nmax) of a word is started with N-gram logging equipment 2 from the object 
document which read the object document from the object document input section 1, and was read from this object 
document input section 1 . The frequency of occurrence of the connection word of N-gram started by this N-gram 
logging equipment 2 is totaled with frequency total equipment 3, and the frequency of occurrence totaled by said 
frequency total equipment 3 is remembered to be a N-gram connection word with the word storage 4. 
[0013] Whenever [ coincidence / of N-gram ] is calculated with count equipment 5 whenever [ coincidence ] using the 
frequency of occurrence in the object document of each word which constitutes said N-gram (in the case of N= 1), and 
the frequency of occurrence of N-gram itself Classification equipment 6 puts in order and changes the information in 
the word storage 4 with the value of whenever [ coincidence / which was calculated with count equipment 5 whenever / 
said coincidence ]. Thus, it can discem whether they are whether the entry which consists of two or more words is 
coincided also to the idiom and copula by combination of the word which is not a high word of an expert, and the 
connection depended by chance, and the efficient degree of coincidence can collect strong copulas automatically. 
[0014] Drawing 2 and drawing 3 are the flow charts for explaining actuation of the copula automatic extracting 
equipment by this invention. Hereafter, according to each step (S), it explains in order. First, Variable i and Variable j 
are set to 1 (SI), and it judges whether the value whose value of Variable j is the number N of the maximum copula 
connection was exceeded (S2). If it is not over the value of N next, from the object document input section 1, j word is 
inputted from i word eye from the head of the text, and it stores in Variable words (S3). Next, if it judges whether the 
word of eye watch (i+j -1) exists and (S4) and a word exist next, it will judge whether the word train in words already 
exists in the frequency-of-occurrence storage 4 (S5). If it does not exist, the contents of words are memorized as a 
count 1 of an appearance to the frequency-of-occurrence storage 4 (S6), only 1 counts up Variable j (S7), and it returns 
to said step 82. 
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[0015] In said step S5, if the word train in words has already existed in the frequency-of-occurrence storage 4, only 1 

-will count up the count of an appearance of the contents of words memorized by the frequency-of-occurrence storage 4 
. (S8), and it will go to said step S7. In said step S2, if the value whose value of Variable j is the nimiber N of the 

maximum copula connection is exceeded, 1 will be set to j, i will be counted up one time, and it will go to (S9) and said 
* step S3. In said step S4, it judges whether if the word of eye watch (i+j -1) does not exist next, j is equal to i (SIO), and 

if not equal, if equal, i-1 will be set to going and the variable A which memorizes the total number of words of an 

object document to said step S9 (Sll). 

[0016] Next, whenever [ coincidence / of the copula of the amount size N individual memorized by the frequency-of- 
occurrence storage 4 ] is calculated. The information which memorized the result to the frequency-of-occurrence 
storage 4 (SI 2), and was memorized by the frequency-of-occurrence storage 4 with classification equipment 6 is 
changed together with the high order of whenever [ coincidence ] (SI 3). 

[0017] Hereafter, the example 1 of this invention is concretely explained based on an example. The text is inputted 
from the object document input section. 
[0018] 
[Table 1] 

The orchestra gave him superb support. -< X:h:SC 

2 3 4 5 6 ) 

[0019] The example which carries out automatic extracting of the 3 grams [ a maximum of] copula is explained. First, 
initial value 1 is set to one variables i and j (SI). Since an object document is read from the object input section 1 and 
Variable j is not over the three maximum copulas (S2), one word (j= 1) is obtained from a head (i= 1) (S3). That is, 
"The" is obtained. Since an i+j-l=l position word exists, (S4) and this word investigate whether the N-gram word 
storage 4 memorizes (S5). Since it does not memorize yet, a word "The" is newly memorized as the count 1 of an 
appearance to the N-gram word storage 4 (S6). When having already memorized, only 1 counts up the count of an 
appearance (S8). Only 1 counts up j (S7) and then two words (j+1) are obtained from a head (i= 1). 
[0020] That is, "The orchestra" is obtained. Since an i+j-l=2 position word exists, (S4) and this word investigate 
whether the N-gram word storage 4 memorizes. Since it does not memorize yet, it memorizes newly and the count of 
an appearance is set to 1. At the time of j= 3, "The orchestra gave" is obtained similarly. It will be set to j= 4 if one j is 
counted up (S7). Since j exceeds three maximum copula connection, 1 is set to j, i is counted up one time, and it is 
referred to as 2 (S9). 

[0021] Next, one word (j= 1) is obtained from the 2nd word (i= 2). "orchestra" is obtained. If the connection word is 
started like the case of i= 1, counting up j to a maximum of 3, "orchestra gave" and "orchestra gave him" will be 
started, and N-gram word storage will memorize with the count of an appearance. Said processing is repeated to i= 6. 
Finally the contents of drawing 4 are memorized by the N-gram word storage 4. Then, according to a formula (1), 
whenever [ coincidence / of each copula ] is calculated from the frequency of occurrence, and a result is memorized to 
the word storage 4 (SI 2). Furthermore, it does again a classification (sort) in the expensive order of whenever 
[ coincidence ]. 

[0022] For example, when preparing 1 million object input statements and extracting the connection word of a 
maximum of 3, the N-gram word storage 4 after count becomes like drawin g 5 whenever [ coincidence ]. It is not 
asking for whenever [ coincidence ] about the case of N= 1. It sorts in the magnitude of the value of whenever 
[ coincidence ], for example, a threshold, such as making 50% of high orders into a candidate, is decided, or the number 
of copulas sorts with the value of whenever [ coincidence ] to things same (as the sort approach). There can be quick 
sort, a merge sort, a simple sorting method, etc., and can carry out, and the method of setting up the value of whenever 
[ coincidence / of the copula extracted for every copula of each die length ] more than with a certain threshold can 
extract the strong copula of whenever [ coincidence ] automatically. 

[0023] The part which drawing 6 is a block diagram for explaining other examples (example 2) of the copula automatic 

extracting equipment by this invention, and seven are conditioning storage pattem match equipment and 8 among 

drawing, in addition carries out the same operation as drawing 1 has attached the same sign. 

[0024] N-gram (N= 1, 2 and 3, ~, Nmax) of a word is started with N-gram logging equipment 2 from the object 

document which read the object document from the object document input gection 1, and was read from this object 

document input section 1. The frequency of occurrence of the connection word of N-gram started by this N-gram 

logging equipment 2 is totaled with frequency total equipment 3, and the frequency of occurrence totaled by said 

frequency total equipment 3 is remembered to be a N-gram connection word with the word storage 4. 

[0025] Whenever [ coincidence / of N-gram ] is calculated with count equipment 5 whenever [ coincidence ] using the 

frequency of occurrence in the object document of each word which constitutes said N-gram (in the case of N= 1), and 
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• the frequency of occurrence of N-gram itself Classification equipment 6 puts in order and changes the information in 
.the word storage 4 with the value of whenever [ coincidence / which was calculated with count equipment 5 whenever / 
said coincidence ]. Thus, it can discem whether they are whether the entry which consists of two or more words is 
coincided also to the idiom and copula by combination of the word which is not a high word of an expert, and the 

' * connection depended by chance, and the efficient degree of coincidence can collect strong copulas automatically. 
[0026] The configuration which does not suit the conditions which should be extracted as vocabulary is memorized, 
pattern match equipment 7 is for ehminating N-gram suitable for the configuration memorized in said conditioning 
store 8, and the conditioning store 8 can extract a copula with still more sufficient extent by using these. That is, in an 
example 2, the configuration which does not suit an extraction condition as vocabulary requires very the pattern which 
cannot serve as a copula easily. For example, there is two connection of a word different from the case of 2 connection 
words of an article and one certain word or pronouns (English his, my, your, their, them, him, etc.) etc. 
[0027] The example of an example 2 is shown below. The example of Table 2 is the case where it is eliminated from 
the object of the extract vocabulary, when it is "the", "a", "an", "his" or, and ~ limited the number of connection with 2, 
and the first word of 2 connection words in that case was remembered to be in equipment. A pattern match is 
performed by the string comparison with pattern match equipment 7. 
:0028] 



Table 2] 







2 


the 


2 


a 


2 


an 


2 


his 


2 


her 


2 


their 


2 


my* 


2 


your 



[0029] Moreover, it is the case where the copula whose 2nd word it is "the", "a" or, and "an" limited the number of 
connection with 2, and the first word of 2 connection words in that case was remembered to be in equipment in the 
example of Table 3, and is "in", "of*, "with", -"from", or "to" is eliminated from the object of the extract vocabulary. A 
pattern match is performed by the string comparison with pattern match equipment 7. 



[0030] 
[Table 31 






2 tbc 


in 


a 


of 


an 


with 




for 




00 




from 




to 



[0031] 

[Effect of the Invention] According to this invention, there is the following effectiveness so that clearly from the above 
explanation. 

(1) Effectiveness corresponding to claim 1 : the word which appears as a copula with strong coincidence in an input 
statement with easy equipment can be quickly extracted efficiently automatically without using a dictionary etc., since 
it can ask for the strength [ the configuration word of a copula ] of coincidence from the frequency of occurrence of the 
configuration word, and the frequency of occurrence of a copula. 

(2) Effectiveness corresponding to claim 2 : since the N-gram pattern considered to be unsuitable for extracting as 
vocabulary in the word extracted above (1) is beforehand memorized to the conditioning store, what matches this 
pattern from a vocabulary candidate can be eliminated, and the vocabulary can be extracted with a sufficient precision. 
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TECHNICAL FIELD 



[Industrial Application] This invention relates to the copula automatic extracting equipment for carrying out automatic 
collection of the copula efficiently fi"om an object document at a detail more about the copula automatic extracting 
equipment in language-processing equipment. For example, it is applied to vocabulary dictionary listing devices, such 
as machine translation and a word processor. 
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PRIOR ART 



[Description of the Prior Art] As well-known reference which indicated conventional language-processing equipment, 
there is JP,6-19968,A, for example. In order for the thing of this official report to enable it to extract a technical term 
easily out of a huge word and to enable it to build a technical-term dictionary easily for a short time An input statement 
is divided into a word with word division equipment, and normalization of part-of-speech information being given is 
performed. The input data which it normalized with word division equipment is outputted to technical-term judging 
equipment, while this technical-term judging equipment refers said each dictionary, evaluation of each word is 
performed, and the candidate of a technical term is extracted according to this evaluation. However, a technical-term 
judging is performed in consideration of the number of configuration words, the operating firequency of a configuration 
word, the vocabulary dictionary classified by field, and a type of letters (katakana word), and the vocabulary dictionary 
classified by field is needed. Moreover, the thing of said official report does not have the description about the 
technical-term candidate selection for a judgment. 
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EXAMPLE 



[Example] An example is explained below with reference to a drawing. First, the connection word of N-gram is 
extracted from an object document (N= 1, 2 and 3, — , Nmax). If it is the case where an object document is English, it 
will refer to a space character etc. from the gestalt-description of language, and every word will be divided. If it is 3- 
gram of N= 3, the vocabulary which carries out a maximum of 3 word connection will be started. Extracted N 
connection word carries out the count total of the frequency of occurrence with the equipment which totals the 
frequency of occurrence. Moreover, the frequency of occurrence for every word is counted, and it totals. This result is 
memorized by N-gram word storage. 

[0008] After the total of the frequency of occurrence to an input-statement document finishes, count of whenever 
[ coincidence ] is calculated according to the following formulas to the connection word of N-gram in N-gram word 
storage. When the words of N connection, i.e., the configuration word of a copula, are wl, w2, w3, ~, wN, 
respectively, the frequency of occurrence of the N connection word itself expresses [ each frequency of occurrence ] 
with H (wl), H (w2), -, H (w3) H (wl, w2, w3, wN). Moreover, the total number of words of an object input- 
statement document is set to A. 
[0009] 
[Equation 1] 

P(w],w2, w3, .... wN) 

(1) 

P(w 1 ) X P(w2) X P(w3) X . .. XP(wN) 



P(wl)= 
P(w2)= 

P(w3)= 



H(wl) 

A 
H(w2) 

A 

H(w3) 



vn H(WN) 

A 



P(wl.w2.w3 wN) = (2) 



H(wI,w2,w3>...wN) 
A 



[0010] (1) The denominator of a formula expresses the probability which each word connects by chance from the 
appearance probability of each word which constitutes a copula. (1) The molecule of a formula is the probability for 
each word to be connected and to actually appear. Therefore, (1) type serves as a ratio of the probability which a certain 
copula actually connects, and the probability connected [ chance ]. (1) It can be said that the connection word of the N- 
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gram has the high degree which coincides and appears, so that the value of a formula is high, conversely, when low, it 
-coincides — as — ** — possibility of connecting by chance is high. 

[001 1] Drawing 1 is a block diagram for explaining one example (example 1) of the copula automatic extracting 
equipment by this invention, and, for N-gram logging equipment and 3, as for N-gram word storage (frequency-of- 
" occurrence storage) and 5, the frequency total equipment of N-gram and 4 are [ one / the object dociunent input section 
and 2 / count equipment and 6 ] classification equipment whenever [ coincidence ] among drawing. 
[0012] N-gram (N= 1, 2 and 3, Nmax) of a word is started with N-gram logging equipment 2 from the object 
document which read the object document from the object docvunent input section 1, and was read from this object 
document input section 1. The frequency of occurrence of the connection word of N-gram started by this N-gram 
logging equipment 2 is totaled with frequency total equipment 3, and the frequency of occurrence totaled by said 
frequency total equipment 3 is remembered to be a N-gram connection word with the word storage 4. 
[0013] Whenever [ coincidence / of N-gram ] is calculated with count equipment 5 whenever [ coincidence ] using the 
frequency of occurrence in the object document of each word which constitutes said N-gram (in the case of N= 1), and 
the frequency of occurrence of N-gram itself. Classification equipment 6 puts in order and changes the information in 
the word storage 4 with the value of whenever [ coincidence / which was calculated vsdth count equipment 5 whenever / 
said coincidence ]. Thus, it can discern whether they are whether the entry which consists of two or more words is 
coincided also to the idiom and copula by combination of the word which is not a high word of an expert, and the 
connection depended by chance, and the efficient degree of coincidence can collect strong copulas automatically. 
[0014] Drawing 2 and drawing 3 are the flow charts for explaining actuation of the copula automatic extracting 
equipment by this invention. Hereafter, according to each step (S), it explains in order. First, Variable i and Variable j 
are set to 1 (SI), and it judges whether the value whose value of Variable j is the number N of the maximum copula 
connection was exceeded (S2). If it is not over the value of N next, from the object document input section 1 , j word is 
inputted from i word eye from the head of the text, and it stores in Variable words (S3). Next, if it judges whether the 
word of eye watch (i+j -1) exists and (S4) and a word exist next, it will judge whether the word train in words already 
exists in the frequency-of-occurrence storage 4 (S5). If it does not exist, the contents of words are memorized as a 
count 1 of an appearance to the frequency-of-occurrence storage 4 (S6), only 1 counts up Variable j (S7), and it returns 
to said step S2. 

[0015] In said step S5, if the word train in words has already existed in the frequency-of-occurrence storage 4, only 1 
will count up the count of an appearance of the contents of words memorized by the frequency-of-occurrence storage 4 
(S8), and it will go to said step S7. In said step S2, if the value whose value of Variable j is the munber N of the 
maximum copula connection is exceeded, 1 will be set to j, i will be counted up one time, and it will go to (S9) and said 
step S3. In said step S4, it judges whether if the word of eye watch (i+j -1) does not exist next, j is equal to i (SIO), and 
if not equal, if equal, i-1 will be set to going and the variable A which memorizes the total number of words of an 
object document to said step S9 (SI 1). 

[0016] Next, whenever [ coincidence / of the copula of the amount size N individual memorized by the frequency-of- 
occurrence storage 4 ] is calculated. The information which memorized the result to the frequency-of-occurrence 
storage 4 (SI 2), and was memorized by the frequency-of-occurrence storage 4 with classification equipment 6 is 
changed together with the high order of whenever [ coincidence ] (SI 3). 

[0017] Hereafter, the example 1 of this invention is concretely explained based on an example. The text is inputted 
from the object document input section. 
[0018] 
[Table 1] 

The orchestra gave him superb support. -< A>3 A 

(I51r*l 2 3 4 5 6 ) 

[0019] The example which carries out automatic extracting of the 3 grams [ a maximum of] copula is explained. First, 
initial value 1 is set to one variables i and j (SI). Since an object document is read from the object input section 1 and 
Variable j is not over the three maximum copulas (S2), one word (j= 1) is obtained from a head (i= 1) (S3). That is, 
"The" is obtained. Since an i+j-l=l position word exists, (S4) and this word investigate whether the N-gram word 
storage 4 memorizes (S5). Since it does not memorize yet, a word "The" is newly memorized as the count 1 of an 
appearance to the N-gram word storage 4 (S6). When having already memorized, only 1 counts up the count of an 
appearance (S8). Only 1 counts up j (S7) and then two words (j+1) are obtained from a head (i= 1). 
[0020] That is, "The orchestra" is obtained. Since an i+j-l=2 position word exists, (S4) and this word investigate 
whether the N-gram word storage 4 memorizes. Since it does not memorize yet, it memorizes newly and the count of 
an appearance is set to 1. At the time of j= 3, "The orchestra gave" is obtained similarly. It will be set to j= 4 if one j is 
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counted up (S7). Since j exceeds three maximum copula connection, 1 is set to j, i is coimted up one time, and it is 
referred to as 2 (S9). 

[0021] Next, one word (j= 1) is obtained from the 2nd word (i= 2). "orchestra" is obtained. If the connection word is 
started hke the case of i= 1, counting up j to a maximum of 3, "orchestra gave" and "orchestra gave him" will be 
" started, and N-gram word storage will memorize with the count of an appearance. Said processing is repeated to i= 6. 
Finally the contents of drawing 4 are memorized by the N-gram word storage 4. Then, according to a formula (1), 
whenever [ coincidence / of each copula ] is calculated from the frequency of occurrence, and a result is memorized to 
the word storage 4 (SI 2). Furthermore, it does again a classification (sort) in the expensive order of whenever 
[ coincidence ]. 

[0022] For example, when preparing 1 million object input statements and extracting the connection word of a 
maximum of 3, the N-gram word storage 4 after count becomes like drawing 5 whenever [ coincidence ]. It is not 
asking for whenever [ coincidence ] about the case of N= 1. It sorts in the magnitude of the value of whenever 
[ coincidence ], for example, a threshold, such as making 50% of high orders into a candidate, is decided, or the number 
of copulas sorts with the value of whenever [ coincidence ] to things same (as the sort approach). There can be quick 
sort, a merge sort, a simple sorting method, etc., and can carry out, and the method of setting up the value of whenever 
[ coincidence / of the copula extracted for every copula of each die length ] more than with a certain threshold can 
extract the strong copula of whenever [ coincidence ] automatically. 

[0023] The part which drawing 6 is a block diagram for explaining other examples (example 2) of the copula automatic 
extracting equipment by this invention, and seven are conditioning storage pattern match equipment and 8 among 
drawing, in addition carries out the same operation as drawing 1 has attached the same sign. 
[0024] N-gram (N= 1, 2 and 3, ~, Nmax) of a word is started with N-gram logging equipment 2 from the object 
document which read the object document from the object document input section 1, and was read from this object 
document input section 1 . The frequency of occurrence of the coimection word of N-gram started by this N-gram 
logging equipment 2 is totaled with frequency total equipment 3, and the frequency of occurrence totaled by said 
frequency total equipment 3 is remembered to be a N-gram connection word with the word storage 4. 
[0025] Whenever [ coincidence / of N-gram ] is calculated with count equipment 5 whenever [ coincidence ] using the 
frequency of occurrence in the object document of each word which constitutes said N-gram (in the case of N= 1), and 
the frequency of occurrence of N-gram itself Classification equipment 6 puts in order and changes the information in 
the word storage 4 with the value of whenever [ coincidence / which was calculated with coimt equipment 5 whenever / 
said coincidence ]. Thus, it can discem whether they are whether the entry which consists of two or more words is 
coincided also to the idiom and copula by combination of the word which is not a high word of an expert, and the 
connection depended by chance, and the efficient degree of coincidence can collect strong copulas automatically. 
[0026] The configuration which does not suit the conditions which should be extracted as vocabulary is memorized, 
pattem match equipment 7 is for eliminating N-gram suitable for the configuration memorized in said conditioning 
store 8, and the conditioning store 8 can extract a copula with still more sufficient extent by using these. That is, in an 
example 2, the configuration which does not suit an extraction condition as vocabulary requires very the pattem which 
cannot serve as a copula easily. For example, there is two connection of a word different from the case of 2 connection 
words of an article and one certain word or pronouns (English his, my, your, their, them, him, etc.) etc. 
[0027] The example of an example 2 is shown below. The example of Table 2 is the case where it is eliminated from 
the object of the extract vocabulary, when it is "the", "a", "an", "his" or, and - limited the number of connection with 2, 
and the first word of 2 connection words in that case was remembered to be in equipment. A pattem match is 
performed by the string comparison with pattem match equipment 7. 
0028] 

Table 2] 

2 the 

2 a 

2 an 

2 hu 

2 her 

2 tbetr 

2 cny 

2 your 



[0029] Moreover, it is the case where the copula whose 2nd word it is "the", "a" or, and "an" limited the number of 
connection with 2, and the first word of 2 connection words in that case, was remembered to be in equipment in the 
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example of Table 3, and is "in", "of, "with", -"from", or "to" is eliminated from the object of the extract vocabulary. A 

pattern match is performed by the string comparison with pattern match equipment 7. 

[0030] 







^ft 2 Miff 


2 


the 


in 




a 


of 




an 


with 






for 






OQ 






£com 






(D 
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DESCRIPTION OF DRAWINGS 



[Brief Description of the Drawings] 

[Drawing 1] It is a block diagram for explaining one example of the copula automatic extracting equipment by this 
invention. 

[Drawing 2] It is a flow chart (the 1) for explaining actuation of the copula automatic extracting equipment by this 
invention. 

[Drawing 3] It is a flow chart (the 2) for explaining actuation of the copula automatic extracting equipment by this 

invention. 

[Drawing 4] It is drawing showing the example of storage of the N-gram word storage in this invention. 
[Drawing 5] It is drawing showing other examples of storage of the N-gram word storage in this invention. 
[Drawing 6] It is a block diagram for explaining other examples of the copula automatic extracting equipment by this 

invention. 

[Description of Notations] 

1~1 is [ ~ It is count equipment and 6 whenever / coincidence /. / ~ It is classification equipment and 7. / ~ It is pattem 
match equipment and 8. / ~ It is a conditioning store. ] the object document input section and 2. ~ It is N-gram logging 
equipment and 3. ~ It is the frequency total equipment of N-gram, and 4. ~ It is a N-gram word store and 5. 
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DRAWINGS 



[Drawing 1] 
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[Drawin g 3] 
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[Drawing 4] 
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Drawing 51 



N3La mmsL 




securities brings fannie 


5 


396920 


securities 


1654 




brings 


64 




fai^nie 


119 




the khmer rouge 


12 


306784 


Che 


116415 




khmer 


14 




rouge 


24 




calli seeking comment 


10 


138985 


calls 


335 




seeking 


342 




comment 


628 




premium over yesteida/s 


6 


67823 


premium 


165 




over 


2437 




yesterday's 


220 




public employees retirement 


6 


47793 


public 


1082 




employees 


586 




i<etiicmedt 


198 




securities brings 


8 


758 


brings fannie 


8 


1050 


the khmer 


12 


7 


khrocr rouge 
calls seeking 


14 


41667 


11 


96 


seeking comment 


11 


51 


premium over 


2S 


70 


over yesterday's 


8 


8 


public employees 


10 


15 


employees retirement 


8 


69 



[Drawing 6] 
1 



A XI ft 



a t SIB 



[Drawing 2] 



http://www4.ipdLncipi.go.jp/cgi-bin/tran_web_cgi_ejje 



1/9/2007 



JP,08-161340.A [DRAWINGS] 



Page 3 of 3 



c 



SI 



ma, m&ji^i^ 
-fey h-ra 




S9 




S10 



yes 




yes 




S8 



1 1 bTseitr* 



S7 



3? JVT V > -5 word s 0 Ffi IJS i7) 



6 



[Translation done.] 



http://www4ipdl.ncipi.gojp/cgi-bin/tran_web_cgi_ejje 



1/9/2007 



